Assignment instructions:

Assignment submission Yos Ramirez: ______________________________________


library(tidyverse)
library(here)
library(janitor)
library(estimatr)  
library(performance)
library(jtools)
library(gt)
library(gtsummary)
library(MASS) ## NOTE: The `select()` function is masked. Use: `dplyr::select()` ##
library(interactions) 

DATA SOURCE:

Reed D. 2019. SBC LTER: Reef: Abundance, size and fishing effort for California Spiny Lobster (Panulirus interruptus), ongoing since 2012. Environmental Data Initiative. https://doi.org/10.6073/pasta/a593a675d644fdefb736750b291579a0. Dataset accessed 11/17/2019.


Introduction

You’re about to dive into some deep data collected from five reef sites in Santa Barbara County, all about the abundance of California spiny lobsters! 🦞 Data was gathered by divers annually from 2012 to 2018 across Naples, Mohawk, Isla Vista, Carpinteria, and Arroyo Quemado reefs.

Why lobsters? Well, this sample provides an opportunity to evaluate the impact of Marine Protected Areas (MPAs) established on January 1, 2012 (Reed, 2019). Of these five reefs, Naples, and Isla Vista are MPAs, while the other three are not protected (non-MPAs). Comparing lobster health between these protected and non-protected areas gives us the chance to study how commercial and recreational fishing might impact these ecosystems.

We will consider the MPA sites the treatment group and use regression methods to explore whether protecting these reefs really makes a difference compared to non-MPA sites (our control group). In this assignment, we’ll think deeply about which causal inference assumptions hold up under the research design and identify where they fall short.

Let’s break it down step by step and see what the data reveals! 📊


Step 1: Anticipating potential sources of selection bias

a. Do the control sites (Arroyo Quemado, Carpenteria, and Mohawk) provide a strong counterfactual for our treatment sites (Naples, Isla Vista)? Write a paragraph making a case for why this comparison is centris paribus or whether selection bias is likely (be specific!).


Step 2: Read & wrangle data

a. Read in the raw data. Name the data.frame (df) rawdata

b. Use the function clean_names() from the janitor package

# HINT: check for coding of missing values (`na = "-99999"`)
# Read in data, clean column names
rawdata <- read_csv(here("data/spiny_abundance_sb_18.csv"), na = "-99999") %>%
                        clean_names()
# Look at first few rows
head(rawdata)
## # A tibble: 6 × 10
##    year month date       site  transect replicate size_mm count num_ao  area
##   <dbl> <dbl> <date>     <chr>    <dbl> <chr>       <dbl> <dbl>  <dbl> <dbl>
## 1  2012     8 2012-08-20 IVEE         1 A              NA     0      0   300
## 2  2012     8 2012-08-20 IVEE         1 B              NA     0      0   300
## 3  2012     8 2012-08-20 IVEE         1 C              NA     0      0   300
## 4  2012     8 2012-08-20 IVEE         1 D              NA     0      0   300
## 5  2012     8 2012-08-20 IVEE         2 A              NA     0      0   300
## 6  2012     8 2012-08-20 IVEE         2 B              NA     0      0   300
# Check if there are any missing values in the entire dataset
sum(is.na(rawdata))
## [1] 350

c. Create a new df named tidyata. Using the variable site (reef location) create a new variable reef as a factor and add the following labels in the order listed (i.e., re-order the levels):

"Arroyo Quemado", "Carpenteria", "Mohawk", "Isla Vista",  "Naples"
# Create dataframe with site variable
tidydata <- rawdata %>%
  mutate(reef = factor(site, levels = c("AQUE", "CARP", "MOHK", "IVEE", "NAPL"), labels = c("Arroyo Quemado", "Carpinteria", "Mohawk", "Isla Vista", "Naples")))
# Look at the first few rows  
head(tidydata) 
## # A tibble: 6 × 11
##    year month date       site  transect replicate size_mm count num_ao  area
##   <dbl> <dbl> <date>     <chr>    <dbl> <chr>       <dbl> <dbl>  <dbl> <dbl>
## 1  2012     8 2012-08-20 IVEE         1 A              NA     0      0   300
## 2  2012     8 2012-08-20 IVEE         1 B              NA     0      0   300
## 3  2012     8 2012-08-20 IVEE         1 C              NA     0      0   300
## 4  2012     8 2012-08-20 IVEE         1 D              NA     0      0   300
## 5  2012     8 2012-08-20 IVEE         2 A              NA     0      0   300
## 6  2012     8 2012-08-20 IVEE         2 B              NA     0      0   300
## # ℹ 1 more variable: reef <fct>

Create new df named spiny_counts

d. Create a new variable counts to allow for an analysis of lobster counts where the unit-level of observation is the total number of observed lobsters per site, year and transect.

e. Create a new variable mpa with levels MPA and non_MPA. For our regression analysis create a numerical variable treat where MPA sites are coded 1 and non_MPA sites are coded 0

#HINT(d): Use `group_by()` & `summarize()` to provide the total number of lobsters observed at each site-year-transect row-observation. 

#HINT(e): Use `case_when()` to create the 3 new variable columns

# Create dataframe with counts, mean_size, and grouping
spiny_counts <- tidydata %>%
  # Group by site, year, and transect
  group_by(site, year, transect) %>%
  # Summarize the total number of lobsters observed
  summarize(counts = sum(count, na.rm = TRUE), 
            mean_size = mean(size_mm, na.rm = TRUE), 
            .groups = "drop") # Ungroup to avoid carry-over

# Create 'mpa' and 'treat' variables using case_when
spiny_counts <- spiny_counts %>%
  mutate(
    # Create 'mpa' based on site
    mpa = case_when(
      site %in% c("NAPL", "IVEE") ~ "MPA",   
      TRUE ~ "non_MPA"                             
    ),
    # Create numeric 'treat' variable (1 for MPA, 0 for non-MPA)
    treat = case_when(
      mpa == "MPA" ~ 1,  
      TRUE ~ 0           
    )
  )

# View the first few rows to ensure everything looks good
head(spiny_counts)
## # A tibble: 6 × 7
##   site   year transect counts mean_size mpa     treat
##   <chr> <dbl>    <dbl>  <dbl>     <dbl> <chr>   <dbl>
## 1 AQUE   2012        1      5      64.2 non_MPA     0
## 2 AQUE   2012        2      9      66   non_MPA     0
## 3 AQUE   2012        3      0     NaN   non_MPA     0
## 4 AQUE   2012        4      9      74.1 non_MPA     0
## 5 AQUE   2012        5     11      76.9 non_MPA     0
## 6 AQUE   2012        6      0     NaN   non_MPA     0

NOTE: This step is crucial to the analysis. Check with a friend or come to TA/instructor office hours to make sure the counts are coded correctly!


Step 3: Explore & visualize data

a. Take a look at the data! Get familiar with the data in each df format (tidydata, spiny_counts)

b. We will focus on the variables count, year, site, and treat(mpa) to model lobster abundance. Create the following 4 plots using a different method each time from the 6 options provided. Add a layer (geom) to each of the plots including informative descriptive statistics (you choose; e.g., mean, median, SD, quartiles, range). Make sure each plot dimension is clearly labeled (e.g., axes, groups).

Create plots displaying the distribution of lobster counts:

  1. grouped by reef site
  2. grouped by MPA status
  3. grouped by year

Create a plot of lobster size :

  1. You choose the grouping variable(s)!
# Plot 1: Density Plot of lobster counts by reef site
spiny_counts %>% 
  ggplot(aes(x = counts, fill = site)) + 
  geom_density(alpha = 0.5) +  # Density plot with transparency
  geom_vline(data = spiny_counts %>% # Add statistics line
               group_by(site) %>% # Group by site
               summarize(mean_count = mean(counts, na.rm = TRUE)), # Mean counts
             aes(xintercept = mean_count, color = site), # Mean counts by site
             linetype = "dashed", size = 1) +  # Add vertical dashed lines for means
  labs( # Labels
    title = "Density Plot of Lobster Counts (with mean) by Reef Site", # Title
    x = "Lobster Count", # X label
    y = "Density" # Y label
  ) +
  theme_minimal() + # Theme
  theme(legend.position = "bottom") # Legend for site

# Histogram plot of lobster counts by MPA status
spiny_counts %>% 
  ggplot(aes(x = counts, fill = mpa)) +  # Use 'counts' and 'treat' for grouping
  geom_histogram(binwidth = 5, alpha = 0.6, position = "dodge") +  # Create histogram with specified bin width, transparency, and position
  geom_vline(data = spiny_counts %>% # Add statistics line
                 group_by(mpa) %>% # Group by mpa
                 summarize(median_count = median(counts, na.rm = TRUE)), # Median counts 
             aes(xintercept = median_count, color = mpa), # Meadian counts by mpa
             linetype = "dashed", size = 1) +  # Add vertical lines for median
    labs( # Labels
    title = "Histogram of Lobster Counts (with median) by MPA Status", # Title
    x = "Lobster Count", # X label
    y = "Frequency" # Y label
  ) +
  scale_fill_manual(values = c("non_MPA" = "skyblue", "MPA" = "violet")) +  # Set custom colors
  theme_minimal() + # Theme
  theme(legend.position = "bottom") # Legend for mpa

# Jitter plot of lobster counts by year with median line
spiny_counts %>%
  ggplot(aes(x = year, y = counts, color = as.factor(year))) + 
  geom_jitter(width = 0.2, height = 0, alpha = 0.6) +  # Add jitter to avoid overlapping points
  geom_hline(data = spiny_counts %>% # Add statistics line
               group_by(year) %>% # Group by year
               summarize(median_count = median(counts, na.rm = TRUE)), # Median counts
             aes(yintercept = median_count, color = as.factor(year)), # Median counts by year
             linetype = "dashed", size = 1) +  # Add horizontal lines for median values
  labs( # Labels
    title = "Jitter Plot of Lobster Counts (with median) by Year", # Title
    x = "Year", # X label
    y = "Lobster Count" # Y label
  ) +
  theme_minimal() + # Theme
  theme(legend.position = "none") # No legend

# Violin plot of lobster size grouped by MPA status
spiny_counts %>%
ggplot(aes(x = mpa, y = mean_size, fill = mpa)) +
  geom_violin() +  # Create the violin plot
  stat_summary(fun = "mean", geom = "point", color = "black", size = 2) +  # Add a point for the mean
  labs(title = "Violin Plot of Lobster Size (with mean) by MPA Status", # Labels for title, x and y axis
       x = "MPA Status",
       y = "Lobster Mean Size (mm)") +
    scale_fill_manual(values = c("non_MPA" = "skyblue", "MPA" = "violet")) + # Custom colors
  theme_minimal() + # Theme
  theme(legend.position = "none") # No legend

c. Compare means of the outcome by treatment group. Using the tbl_summary() function from the package gt_summary

# USE: gt_summary::tbl_summary()
# Create a summary table
spiny_counts %>%
  tbl_summary(by = treat) %>%  # Group by 'treatment' (0 non MPA, 1 MPA) 
  add_p()  # Add a p-value to check if the difference is statistically significant
Characteristic 0
N = 133
1
1
N = 119
1
p-value2
site

<0.001
    AQUE 49 (37%) 0 (0%)
    CARP 63 (47%) 0 (0%)
    IVEE 0 (0%) 56 (47%)
    MOHK 21 (16%) 0 (0%)
    NAPL 0 (0%) 63 (53%)
year

>0.9
    2012 19 (14%) 17 (14%)
    2013 19 (14%) 17 (14%)
    2014 19 (14%) 17 (14%)
    2015 19 (14%) 17 (14%)
    2016 19 (14%) 17 (14%)
    2017 19 (14%) 17 (14%)
    2018 19 (14%) 17 (14%)
transect

0.7
    1 21 (16%) 14 (12%)
    2 21 (16%) 14 (12%)
    3 21 (16%) 14 (12%)
    4 14 (11%) 14 (12%)
    5 14 (11%) 14 (12%)
    6 14 (11%) 14 (12%)
    7 14 (11%) 14 (12%)
    8 7 (5.3%) 14 (12%)
    9 7 (5.3%) 7 (5.9%)
counts 9 (4, 22) 13 (3, 34) 0.3
mean_size 73 (68, 77) 77 (71, 80) <0.001
    Unknown 15 12
mpa

<0.001
    MPA 0 (0%) 119 (100%)
    non_MPA 133 (100%) 0 (0%)
1 n (%); Median (Q1, Q3)
2 Pearson’s Chi-squared test; Wilcoxon rank sum test

Step 4: OLS regression- building intuition

a. Start with a simple OLS estimator of lobster counts regressed on treatment. Use the function summ() from the jtools package to print the OLS output

b. Interpret the intercept & predictor coefficients in your own words. Use full sentences and write your interpretation of the regression results to be as clear as possible to a non-academic audience.

# NOTE: We will not evaluate/interpret model fit in this assignment (e.g., R-square)
# OLS model
m1_ols <- lm(counts ~ treat, data = spiny_counts)
# OLS output
summ(m1_ols, model.fit = FALSE) 
Observations 252
Dependent variable counts
Type OLS linear regression
Est. S.E. t val. p
(Intercept) 22.73 3.57 6.36 0.00
treat 5.36 5.20 1.03 0.30
Standard errors: OLS

c. Check the model assumptions using the check_model function from the performance package

d. Explain the results of the 4 diagnostic plots. Why are we getting this result?

check_model(m1_ols,  check = "qq" ) # Q-Q plot for normality of residuals

check_model(m1_ols, check = "normality") # Normality of residuals for normal curve

check_model(m1_ols, check = "homogeneity") # Homogeneity of variance with fitted values

check_model(m1_ols, check = "pp_check") # Posterior predictive check 


Step 5: Fitting GLMs

a. Estimate a Poisson regression model using the glm() function

b. Interpret the predictor coefficient in your own words. Use full sentences and write your interpretation of the results to be as clear as possible to a non-academic audience.

c. Explain the statistical concept of dispersion and overdispersion in the context of this model.

d. Compare results with previous model, explain change in the significance of the treatment effect

#HINT1: Incidence Ratio Rate (IRR): Exponentiation of beta returns coefficient which is interpreted as the 'percent change' for a one unit increase in the predictor 

#HINT2: For the second glm() argument `family` use the following specification option `family = poisson(link = "log")`
# GLM model with poisson distribution
m2_pois <- glm(counts ~ treat, family = poisson(link = "log"), data = spiny_counts)
# Summary of model
summary(m2_pois)
## 
## Call:
## glm(formula = counts ~ treat, family = poisson(link = "log"), 
##     data = spiny_counts)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.12366    0.01819 171.744   <2e-16 ***
## treat        0.21184    0.02510   8.441   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 10438  on 251  degrees of freedom
## Residual deviance: 10366  on 250  degrees of freedom
## AIC: 11366
## 
## Number of Fisher Scoring iterations: 6
exp(coef(m2_pois)) # Exponents of model coefficients
## (Intercept)       treat 
##   22.729323    1.235956
# Calculate deviance and degrees of freedom
deviance(m2_pois) / df.residual(m2_pois)
## [1] 41.46596
# OLS model summary
summary(m1_ols)
## 
## Call:
## lm(formula = counts ~ treat, data = spiny_counts)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.092 -21.729 -13.729   4.271 242.908 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   22.729      3.573   6.361 9.49e-10 ***
## treat          5.363      5.200   1.031    0.303    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.21 on 250 degrees of freedom
## Multiple R-squared:  0.004237,   Adjusted R-squared:  0.0002538 
## F-statistic: 1.064 on 1 and 250 DF,  p-value: 0.3034
# Poisson regression model summary
summary(m2_pois)
## 
## Call:
## glm(formula = counts ~ treat, family = poisson(link = "log"), 
##     data = spiny_counts)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.12366    0.01819 171.744   <2e-16 ***
## treat        0.21184    0.02510   8.441   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 10438  on 251  degrees of freedom
## Residual deviance: 10366  on 250  degrees of freedom
## AIC: 11366
## 
## Number of Fisher Scoring iterations: 6

e. Check the model assumptions. Explain results.

f. Conduct tests for over-dispersion & zero-inflation. Explain results.

check_model(m2_pois) # Model assumptions 

check_overdispersion(m2_pois) # Overdispersion test
## # Overdispersion test
## 
##        dispersion ratio =    67.033
##   Pearson's Chi-Squared = 16758.289
##                 p-value =   < 0.001
check_zeroinflation(m2_pois) # Zero inflation test
## # Check for zero-inflation
## 
##    Observed zeros: 27
##   Predicted zeros: 0
##             Ratio: 0.00

g. Fit a negative binomial model using the function glm.nb() from the package MASS and check model diagnostics

h. In 1-2 sentences explain rationale for fitting this GLM model.

i. Interpret the treatment estimate result in your own words. Compare with results from the previous model.

# NOTE: The `glm.nb()` function does not require a `family` argument
# Negative binomial model
m3_nb <- glm.nb(counts ~ treat, data = spiny_counts)
# Summary of model
summary(m3_nb)
## 
## Call:
## glm.nb(formula = counts ~ treat, data = spiny_counts, init.theta = 0.5500333101, 
##     link = log)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   3.1237     0.1183  26.399   <2e-16 ***
## treat         0.2118     0.1720   1.232    0.218    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.55) family taken to be 1)
## 
##     Null deviance: 302.18  on 251  degrees of freedom
## Residual deviance: 300.66  on 250  degrees of freedom
## AIC: 2088.5
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.5500 
##           Std. Err.:  0.0466 
## 
##  2 x log-likelihood:  -2082.5280
check_overdispersion(m3_nb) # Overdispersion test
## # Overdispersion test
## 
##  dispersion ratio = 1.398
##           p-value = 0.088
check_zeroinflation(m3_nb) # Zero inflation test
## # Check for zero-inflation
## 
##    Observed zeros: 27
##   Predicted zeros: 30
##             Ratio: 1.12
check_predictions(m3_nb) # Posterior predictive check 

check_model(m3_nb) # Model assumptions


Step 6: Compare models

a. Use the export_summ() function from the jtools package to look at the three regression models you fit side-by-side.

c. Write a short paragraph comparing the results. Is the treatment effect robust or stable across the model specifications.

# Look at 3 regression models to compare
export_summs(
    m1_ols,  # OLS model
    m2_pois,  # Poisson regression model
    m3_nb,    # Negative Binomial model
    model.names = c("OLS","Poisson", "NB"),
    statistics = "none")
OLSPoissonNB
(Intercept)22.73 ***3.12 ***3.12 ***
(3.57)   (0.02)   (0.12)   
treat5.36    0.21 ***0.21    
(5.20)   (0.03)   (0.17)   
*** p < 0.001; ** p < 0.01; * p < 0.05.

Step 7: Building intuition - fixed effects

a. Create new df with the year variable converted to a factor

b. Run the following negative binomial model using glm.nb()

c. Take a look at the regression output. Each coefficient provides a comparison or the difference in means for a specific sub-group in the data. Informally, describe the what the model has estimated at a conceptual level (NOTE: you do not have to interpret coefficients individually)

d. Explain why the main effect for treatment is negative? *Does this result make sense?

# Create a new data frame with year as factor
ff_counts <- spiny_counts %>% 
    mutate(year=as_factor(year))
# Run negative binomial model for counts and treat, add fixed year, add interaction for treat*year
m5_fixedeffs <- glm.nb(
    counts ~ 
        treat +
        year +
        treat*year,
    data = ff_counts)
# Summarize model
summ(m5_fixedeffs, model.fit = FALSE)
Observations 252
Dependent variable counts
Type Generalized linear model
Family Negative Binomial(0.8129)
Link log
Est. S.E. z val. p
(Intercept) 2.35 0.26 8.89 0.00
treat -1.72 0.42 -4.12 0.00
year2013 -0.35 0.38 -0.93 0.35
year2014 0.08 0.37 0.21 0.84
year2015 0.86 0.37 2.32 0.02
year2016 0.90 0.37 2.43 0.01
year2017 1.56 0.37 4.25 0.00
year2018 1.04 0.37 2.81 0.00
treat:year2013 1.52 0.57 2.66 0.01
treat:year2014 2.14 0.56 3.80 0.00
treat:year2015 2.12 0.56 3.79 0.00
treat:year2016 1.40 0.56 2.50 0.01
treat:year2017 1.55 0.56 2.77 0.01
treat:year2018 2.62 0.56 4.69 0.00
Standard errors: MLE

e. Look at the model predictions: Use the interact_plot() function from package interactions to plot mean predictions by year and treatment status.

f. Re-evaluate your responses (c) and (b) above.

# Plot the interaction between treatment and year
interact_plot(m5_fixedeffs, pred = year, modx = treat,
              outcome.scale = "response") # NOTE: y-axis on log-scale

# HINT: Change `outcome.scale` to "response" to convert y-axis scale to counts

g. Using ggplot() create a plot in same style as the previous interaction plot, but displaying the original scale of the outcome variable (lobster counts). This type of plot is commonly used to show how the treatment effect changes across discrete time points (i.e., panel data).

The plot should have… - year on the x-axis - counts on the y-axis - mpa as the grouping variable

# Hint 1: Group counts by `year` and `mpa` and calculate the `mean_count`
# Hint 2: Convert variable `year` to a factor
# Group the data by 'year' and 'mpa', then calculate the mean count for each group
plot_counts <- ff_counts %>%
  group_by(year, mpa) %>%
  summarize(mean_count = mean(counts, na.rm = TRUE)) %>%
  ungroup()
# Convert 'year' to a factor
plot_counts$year <- as.factor(plot_counts$year)
# plot_counts %>% ggplot() ...
# Create the plot
plot_counts %>%
  ggplot(aes(x = year, y = mean_count, color = mpa, group = mpa)) +
  geom_line(size = 1) +  # Line plot to show trends over time
  geom_point(size = 3) +  # Points for each year
  labs( # Labels for title, x and y axis
    title = "Lobster Counts by MPA Status and Year",
    x = "Year",
    y = "Mean Lobster Count",
    color = "MPA Status" # Set color by mpa
  ) +
  scale_color_manual(
      values = c("non_MPA" = "skyblue", "MPA" = "violet")) +  # Custom colors for MPA vs non-MPA
  theme_minimal() + # Theme
  theme(legend.position = "bottom") # Legend


Step 8: Reconsider causal identification assumptions

  1. Discuss whether you think spillover effects are likely in this research context (see Glossary of terms; https://docs.google.com/document/d/1RIudsVcYhWGpqC-Uftk9UTz3PIq6stVyEpT44EPNgpE/edit?usp=sharing)
  1. Explain why spillover is an issue for the identification of causal effects
  1. How does spillover relate to impact in this research setting?
  1. Discuss the following causal inference assumptions in the context of the MPA treatment effect estimator. Evaluate if each of the assumption are reasonable:

    1. SUTVA: Stable Unit Treatment Value assumption
    2. Excludability assumption

EXTRA CREDIT

Use the recent lobster abundance data with observations collected up until 2024 (lobster_sbchannel_24.csv) to run an analysis evaluating the effect of MPA status on lobster counts using the same focal variables.

  1. Create a new script for the analysis on the updated data

Read in data, clean data, and create dataframe with counts and mean lobster size

# Read in data, clean column names
rawdata24 <- read_csv(here("data/lobster_sbchannel_24.csv"), na = "-99999") %>%
                        clean_names()
tidydata24 <- rawdata24 %>%
  mutate(reef = factor(site, levels = c("AQUE", "CARP", "MOHK", "IVEE", "NAPL"), labels = c("Arroyo Quemado", "Carpinteria", "Mohawk", "Isla Vista", "Naples")))
# Create dataframe with counts, mean_size, and grouping
spiny_counts24 <- tidydata24 %>%
  # Group by site, year, and transect
  group_by(site, year, transect) %>%
  # Summarize the total number of lobsters observed
  summarize(counts = sum(count, na.rm = TRUE), 
            mean_size = mean(size_mm, na.rm = TRUE), 
            .groups = "drop") # Ungroup to avoid carry-over

# Create 'mpa' and 'treat' variables using case_when
spiny_counts24 <- spiny_counts24 %>%
  mutate(
    # Create 'mpa' based on site
    mpa = case_when(
      site %in% c("NAPL", "IVEE") ~ "MPA",   
      TRUE ~ "non_MPA"                             
    ),
    # Create numeric 'treat' variable (1 for MPA, 0 for non-MPA)
    treat = case_when(
      mpa == "MPA" ~ 1,  
      TRUE ~ 0           
    )
  )
  1. Run at least 3 regression models & assess model diagnostics

The following regression models show some differences to the data that was previously analyzed. This data includes years up to 2024, adding 6 more years to the data previously analyzed.

# OLS model
m1_ols24 <- lm(counts ~ treat, data = spiny_counts24)
# Poisson regression model
m2_pois24 <- glm(counts ~ treat, family = poisson(link = "log"), data = spiny_counts24)
# Negative binomial model
m3_nb24 <- glm.nb(counts ~ treat, data = spiny_counts24)
# OLS model summary
print(summary(m1_ols24))
## 
## Call:
## lm(formula = counts ~ treat, data = spiny_counts24)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.986 -25.986 -14.268   7.764 236.014 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   27.268      2.687  10.149   <2e-16 ***
## treat          7.718      3.910   1.974    0.049 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.14 on 464 degrees of freedom
## Multiple R-squared:  0.008326,   Adjusted R-squared:  0.006189 
## F-statistic: 3.896 on 1 and 464 DF,  p-value: 0.049
# Poisson regression model summary
print(summary(m2_pois24))
## 
## Call:
## glm(formula = counts ~ treat, family = poisson(link = "log"), 
##     data = spiny_counts24)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.30572    0.01221  270.75   <2e-16 ***
## treat        0.24923    0.01670   14.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 19797  on 465  degrees of freedom
## Residual deviance: 19574  on 464  degrees of freedom
## AIC: 21530
## 
## Number of Fisher Scoring iterations: 6
# Negative binomial model summary
print(summary(m3_nb24))
## 
## Call:
## glm.nb(formula = counts ~ treat, data = spiny_counts24, init.theta = 0.5769416479, 
##     link = log)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.30572    0.08482  38.972   <2e-16 ***
## treat        0.24923    0.12330   2.021   0.0432 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.5769) family taken to be 1)
## 
##     Null deviance: 562.24  on 465  degrees of freedom
## Residual deviance: 558.14  on 464  degrees of freedom
## AIC: 4058
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.5769 
##           Std. Err.:  0.0362 
## 
##  2 x log-likelihood:  -4052.0430
# Check models assumptions
check_model(m1_ols24)

check_model(m2_pois24)

check_model(m3_nb24)

check_overdispersion(m3_nb24) # Overdispersion test
## # Overdispersion test
## 
##  dispersion ratio = 1.035
##           p-value = 0.808
check_zeroinflation(m3_nb24) # Zero inflation test
## # Check for zero-inflation
## 
##    Observed zeros: 51
##   Predicted zeros: 47
##             Ratio: 0.91
  1. Compare and contrast results with the analysis from the 2012-2018 data sample (~ 2 paragraphs)
# Create a new data frame with year as factor
ff_counts24 <- spiny_counts24 %>% 
    mutate(year=as_factor(year))
# Run negative binomial model for counts and treat, add fixed year, add interaction for treat*year
m5_fixedeffs24 <- glm.nb(
    counts ~ 
        treat +
        year +
        treat*year,
    data = ff_counts24)
# Plot the interaction between treatment and year
interact_plot(m5_fixedeffs24, pred = year, modx = treat,
              outcome.scale = "response") # NOTE: y-axis on log-scale